Design aspects of COVID-19 treatment trials: Improving probability and time of favourable events

As a reaction to the pandemic of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a multitude of clinical trials for the treatment of SARS-CoV-2 or the resulting corona disease (COVID-19) are globally at various stages from planning to completion. Although some attempts were made to standardize study designs, this was hindered by the ferocity of the pandemic and the need to set up trials quickly. We take the view that a successful treatment of COVID-19 patients (i) increases the probability of a recovery or improvement within a certain time interval, say 28 days; (ii) aims to expedite favourable events within this time frame; and (iii) does not increase mortality over this time period. On this background we discuss the choice of endpoint and its analysis. Furthermore, we consider consequences of this choice for other design aspects including sample size and power and provide some guidance on the application of adaptive designs in this particular context.


Introduction
At the time of writing, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic is ongoing. As a reaction to the pandemic, a multitude of clinical trials on the treatment of SARS-CoV-2 or the resulting corona disease  are globally in planning, were recently initiated or already completed. Although some attempts were made to standardize study designs, this was hindered by the ferocity of the pandemic and the need to set up trials quickly.
For randomized controlled trials evaluating the safety and efficacy of COVID-19 treatments, the discussion on appropriate outcomes has received considerable attention in the meanwhile. For instance, Dodd et al. (2020) discuss the use of survival methodology to investigate both an increase of the event probability of a favorable outcome such as improvement or recovery and its timing. Similar to McCaw et al. (2020b), Dodd et al. stress that such outcomes are subject to competing risks or competing events, because patients may die without having achieved the favorable outcome. The authors argue that these patients must not be censored at their time of death but, say, at day 28 after treatment, if the trial investigates a 28-day-followup. The authors continue to advocate investigating hazard ratios based on such data (called improvement or recovery rate ratio by the authors, because the outcome is not hazardous to health, but favorable). Furthermore, Benkeser et al. (2020), an early methodological publication on COVID-19, consider time-to-event outcomes in a paper advocating covariate adjustment in randomized trials, but do not consider competing events. Rather, the authors consider a composite of intubation and death, thereby avoiding the need to model competing events. This composite combines two unfavorable outcomes, but for outcomes recovery or improvement, such a composite endpoint that also includes death is not meaningful.
The role of censoring here is subtle: Dodd et al. (2020) state that the improvement or recovery rate ratio approach coincides with the subdistribution hazard ratio approach of Fine and Gray (1999), if there is additional, 'usual' censoring as a consequence of staggered study entry. McCaw et al. (2020b), on the other hand, characterize the approach to censor previous deaths at day 28, still assuming a 28-day-follow-up, as 'unusual' and warn against the use of one minus Kaplan-Meier, if there is additional censoring before day 28, say, because of staggered entry. In our presentation of motivating examples in Section 2, we will see that both Kaplan-Meier estimates censoring at the time of death and Kaplan-Meier estimates censoring at day 28 are being used in COVID-19 trials.
Recently, Kahan et al. (2020) discussed outcomes in the light of the estimand framework. One aim of this paper is to clarify and provide guidance with respect to the estimands at hand when using survival methodology to investigate both an increase of the event probability of a favorable outcome and its timing. To this end, we will demonstrate that censoring deaths on day 28 in a trial with a 28-day-follow-up conceptually corresponds to formalizing time to improvement or recovery via improper failure times, which we will call subdistribution times, with probability mass at infinity. The latter corresponds to the probability of death during 28-day or, more generally, τ -day-follow-up. This has various consequences: It allows to formalize mean and median times to improvement (recovery). The Kaplan-Meier estimator based on death-censored-on-day-τ -data will coincide with the Aalen-Johansen estimator of the cumulative event probability considering the competing event death, provided that there is no additional censoring. The hazard ratio at hand will be a subdistribution hazard ratio as a consequence of using subdistribution times, but not as a consequence of additional censoring.
In this manuscript, we take the view that a successful treatment of COVID-19 patients (i) increases the probability of a recovery or improvement within a certain time interval, say 28 days; (ii) aims to expedite recovery or improvement within this time frame; and (iii) does not increase mortality over this time period (see e.g. Wilt et al. (2020)). This should be reflected in the main outcomes of a COVID-19 treatment trial. The choice of outcomes has also some implications for the trial design. Firstly, even in traditional the sample size calculation might be complicated by the presence of competing events. Secondly, novel trial designs including platform trials and adaptive group-sequential designs are more frequently applied than usual in COVID-19 treatment trials. Stallard et al. (2020) provide an overview over such designs, discuss their utility in COVID-19 trials and make some recommendations. In the light of the outcome discussion, we provide some comments on the application of such outcomes in adaptive designs.
The manuscript is organized as follows. In Section 2 background on some example trials is provided to motivate the investigations presented here. In Section 3 outcomes, their analysis and interpretation are considered before some guidance is provided on planning such trials in Section 4. We close with a brief discussion in Section 5.

Motivating examples
Our starting point is that a successful treatment of COVID-19 patients (i) increases the proportion of recoveries within a time interval [0, τ ], say, τ = 28 days; (ii) aims to expedite recovery on [0, τ ]; and (iii) does not increase mortality at time τ . Aim (i) is obviously desirable both from a patient's perspective and from a public health perspective. The rationale behind aim (ii) is that two different treatments that lead to comparable recovery proportions at time τ may differ in the timing of recoveries. Here, faster recovery is not only desirable from a public health perspective with respect to available resources, but faster recovery from ventilation will also benefit the individual patient. Finally, requirement (iii) reflects that a treatment that increases both the proportion of recoveries and the proportion of deaths at time τ benefits some patients and harms others.
We will argue that aims (i)-(iii) cast COVID-19 trials into a competing events (or competing risks) setting, although this is not necessarily or not explicitly recognized. For example, the primary clinical endpoint of Wang et al. (2020) was time to clinical improvement within 28 days after randomisation, addressing aims (i) and (ii). Within τ = 28 days, 13% of the patients in the placebo group and and 14% in the treatment (Remdisivir) group died. The authors aimed to address such competing mortality before clinical improvement by right-censoring time to clinical improvement at τ for patients dying before τ . The authors then used the usual machinery of Kaplan-Meier, log-rank and Cox proportional hazards regression. However, as we will see below, their analysis amounts to using the Aalen-Johansen estimator of the cumulative event probability (instead of Kaplan-Meier), Gray's test for comparing cumulative event probabilities (or subdistributions) between groups (instead of the common log-rank test) and Fine and Gray's proportional subdistribution hazards model (instead of the usual Cox model) (Beyersmann et al., 2012).
Other recent examples are Beigel et al. (2020) who consider time to recovery on days [0, 28] and Cao et al. (2020) whose primary outcome is time to clinical improvement until day 28. Beigel et al. also censor previous deaths 'on the last observation day' (see the Appendix of Beigel et al.) and use Kaplan-Meier (here, actually, Aalen-Johansen) and log-rank test (here, actually, Gray's test) to analyse these data. Cao et al. censor both 'failure to reach clinical improvement or death before day 28' on day 28 and use Kaplan-Meier, log-rank and the Cox model (here, actually, Fine and Gray). Interestingly, Cao et al. comment that 'right-censoring occurs when an event may have occurred after the last time a person was under observation, but the specific timing of the event is unknown', although this is clearly not the case for patients censored at day 28 following death before that time.
To be precise, let ϑ be the time to clinical improvement, using the example of Wang et al.. Time to improvement is, in general, not well defined for patients dying prior to improvement. To address this, the subdistribution time is defined as ϑ = ∞ for the latter patients. The interpretation of the improper random variable ϑ is that it equals the actual time of improvement when ϑ < ∞. However, patients who die before improvement will never experience this primary outcome and, hence, ϑ = ∞. The censored subdistribution time in the paper of Wang et al. becomes Writing T for the time to improvement (ε = 1) or death (ε = 2), whatever comes first, one minus the Kaplan-Meier estimator based on the censoredθ data equals the Aalen-Johansen estimator of the cumulative improvement probability P (T ≤ t, ε = 1).
This has been documented elsewhere (Geskus, 2011) but it is most easily seen for the present case of censoring at one commmon τ . We provide the calculation in the Appendix. In the previous display, T is the time-to-first-event and ε is the type-of-first-event, i.e., P (T ≤ t, ε = 1) is the cumulative event probability or the so-called cumulative incidence function of the type 1 event. Censoring at one commmon τ is a simple case of 'censoring complete' data (Fine and Gray, 1999) which we exploit in the Appendix. 'Censoring complete' means thatθ is also known for those observed to die before improvement. 'Observed to die' means that death has occurred before censoring. Such patients will not be further followed-up, and the future censoring time becomes latent and is, in general, not known. This complicates general technical developments in the subdistribution framework (Fine and Gray, 1999), but here the censored subdistribution times are known, see our Appendix. The hazard 'attached' to ϑ neither equals the all-events hazard of T nor the event-specific hazards of T and either ε = 1 or ε = 2 (Beyersmann et al., 2012). One important consequence is that using standard log-rank software on the censored subdistribution times gives Gray's test for comparing event probabilities P (T ≤ t, ε = 1) between groups, using standard proportional hazards software employs Fine and Gray's proportional subdistribution hazards model, casting these analyses into a competing events setting.
We will investigate the consequences of competing events in the following sections, including alternatives to the subdistribution framework, the need to still analyse competing mortality and possible strategies to account for death after recovery. Here, we stress that the aim to account for our items i)-iii) above has led authors to implicitly employ a competing events analysis, although this is not explicitly acknowledged. One worry is that the subdistribution hazards framework at hand has repeatedly been re-examined, questioning the interpretability of a hazard belonging to an improper random variable (Andersen and Keiding, 2012).
The key issue here is that patients are still kept 'at risk' after death and until censoring at τ , although, of course, no further events will be observed for these patients.
There are further examples of the presence of competing events in COVID-19 studies. For instance, Grein et al. (2020) use Kaplan-Meier for time-to-clinical improvement at day 28. These authors report a Kaplan-Meier estimate of 84% for the cumulative improvement probability at τ = 28, although improvement was only observed for 36 (68%) out of 53 patients. Letters to the Editor and a Reply by Grein et al. reveal that the original Kaplan-Meier analysis had censored deaths before improvement at the time of death, but not at τ . It is well known that such an analysis is subject to 'competing risks bias' and must inevitably overestimate cumulative event probabilities (Beyersmann et al., 2012).
As the last example of this section we consider ventilator-free days (VFDs), see Yehya et al. (2019), which is the primary or secondary outcome in a number of ongoing trials (trial identifiers NCT04360876, NCT04315948, NCT04348656, NCT04357730, NCT03042143, NCT04372628,NCT04389580 at clinicaltrials.gov and DRKS00021238 at clinicaltrialsregister.eu.) Yehya et al. provide an applied tutorial on using VFDs as an outcome measure in respiratory medicine. Similar to our aims (i)-(iii) above, they argue in favour of using VFDs, because they 'penalize nonsurvivors', that using time as an outcome 'provide[s] greater statistical power to detect a treatment effect than the binary outcome measure' and that time is relevant in that 'shortened ventilator duration is clinically and economically meaningful'. Again, censored subdistribution times are present, although this is not made explicit in the definition of VFDs. Choosing once more a time horizon of τ = 28 days, VFDs are defined as 28 − x if ventilation stops on day x, but are defined as 0 if the patient either dies while being ventilated or is still alive and ventilated after [0,28]. Interpreting the subdistribution time ϑ as the day when ventilation is stopped while alive (and not as a consequence of death), we get VFDs = 28 −θ and Yehya et al. consequently also suggest the proportional subdistribution hazards model as one possible statistical analysis. Our main model will be the competing events model in Section 3.1, considering only the solid arrows in the Figure. Improvement or recovery is modelled by a 0 → 1 transition, death without prior improvement (or recovery) is modelled by a 0 → 2 transition. Later, in Section 3.3, we will briefly consider an extension of this model to an illness-death model without recovery by also considering 1 → 2 transitions, i.e., death after improvement events. This is illustrated in Figure 1 by the dashed arrow. Note that 'illness-death without recovery' does not mean that recovery may not be modelled, but that 1 → 0 transitions are not considered. In terms of outcomes, X(t) = 1 in the competing events model means that improvement has occurred on [0, t], but in the illness-death model, X(t) = 1 means that improvement has occurred and that the patient is still alive. The distinction may be relevant for trials in patients where possible subsequent death on [0, τ ] is a concern, see Sommer et al. (2018) for a discussion of death after clinical cure in treatment trials for severe infectious diseases.

Competing events: time to and type of first event
Both in the competing events and, later, in the illness-death model, time-to-first-event is the waiting time in state 0 of the Figure, with type-of-first-event ǫ = X(T ), the state the process enters upon leaving the initial state. The tupel (T, X(T )) defines a competing events situation. Note that competing events are characterized by time-to-first-event and type-of-first-event; it is not assumed that there are no further events after a first event. However, the analysis of subsequent events requires more complex models such as an illness-death model. The stochastic process for time and type of the first event is regulated by the event-(or cause-) specific hazards which we assume to exist. Their sum is the usual all-events hazard of time T with survival function Note that T is the time until a composite of improvement or death, whatever comes first, which is not a meaningful outcome in the present setting, combining an endpoint that benefits the patient with one that harms the patient. Rather, as discussed above, authors consider the cumulative improvement probability with subdistribution time ϑ until improvement defined as A binary outcome, say improvement status at time τ , is covered by this framework, with indicator function 1(·) and improvement probability (at time τ ) P (T ≤ τ, X(T ) = 1). However, viewing quantities (4) as a function of time allows to detect earlier improvement with possibly comparable improvement probabilities at τ . The expected proportion of deaths at τ without prior improvement is P (T ≤ τ, X(T ) = 2). Key quantities of the competing events model are time and type of first event, (T, X(T )), and the eventspecific hazards (α 01 (t), α 02 (t)). The subdistribution time ϑ appears to be little more than an afterthought of (4). However, its relevance is closely connected to the subdistribution hazard which neither equals any of the event-specific hazards nor the all-events hazard. It also reappears in the context of 'average' improvement or recovery times discussed towards the end of the present Subsection.
The subdistribution hazard λ(t) is the hazard 'attached' to (4) by requiring leading (Beyersmann et al., 2012) to which illustrates why interpretation of the subdistribution hazard as a hazard has been subject of debate (Andersen and Keiding, 2012). The event-specific hazards α 0j (t) have the interpretation of an instantaneous 'risk' of a type j event at time t given one still is in state 0 just prior to time t. They may be visualized  as forces moving along the solid arrows in Figure 1. There is no such interpretation for the subdistribution hazard. The above display also illustrates that a competing subdistribution hazard, 'attached' to P (T ≤ t, X(T ) = 2) may not be chosen or modelled freely. This is in contrast to the event-specific hazards. The rationale of the subdistribution hazard is that it reestablishes a one-to-one correspondence with the cumulative event probability (4), which is a function of both event-specific hazards α 01 (t) and α 02 (t). The subdistribution hazard approach may also be viewed via a transformation model for (4) using the link x → log(− log(1 − x)) as in a Cox model for the all-events hazard, but still without a common hazard interpretation. Several authors have argued in favor of link functions which are more amenable to interpretation such as a logistic link. See also Section 4.1 on study planning with respect to odds ratios. Against the background of the motivating examples in Section 2, we will put a certain emphasis on the former link but also note that it is not uncommon that results from either link function coincide from a practical point of view (Beyersmann and Scheike, 2014). In the absence of competing events, this has been well documented for studies with short follow-up and low cumulative event probability Annesi et al. (1989). In the present setting, trials will aim at increasing cumulative improvement or recovery probabilities, but 'competing' mortality implies that these probabilities must be below one, which distinguishes competing events from the all-events framework.
Finally, the subdistribution time ϑ is useful for formalizing 'average' improvement or recovery times. Assuming 'competing' mortality, i.e., P (T ≤ τ, X(T ) = 2) > 0, it is easy to see that because P (ϑ = ∞) = P (X(T ) = 2). Consequently, the expected or mean time to improvement (recovery) is not a useful parameter. It is well known that standard survival analysis (time-to-all-causes-death), expected survival time is typically not investigated, but for a different reason. For the latter, expected survival time is a finite number, but it is usually not identifiable, at least not nonparametrically, because of limited follow-up. Both in this and in the present context, two possible solutions for investigating 'average' times-to-event are restricted means and median times. For the former, Andersen (2013) considers If the competing events are different causes of death, Andersen interprets (8) as the mean time span lost before time τ and 'due to cause 1'. The area under the cumulative event probability may hence be interpreted as the mean time to improvement (recovery) before time τ . For COVID-19 trials, this parameter has recently been suggested by McCaw et al. (2020a), however, without giving formulae or making the link to subdistribution times explicit. This is also related to the ventilator free days discussed in Section 2. A recent example of using (8) is Hao et al. (2020) who considered influenza-attributable life years lost before the age of 90. Alternatively, one may consider the median time to improvement, Again, there is a conceptual difference to median survival time in that the latter always is a finite number (denying the possibility of immortality), but quantity (9) will be defined as infinity, if the eventual improvement probability does not reach 50%. However, if its Aalen-Johansen estimator (see Subsection 3.2 below) does reach 50% on [0, τ ], the median time-to-improvement can be estimated nonparametrically by plug-in of the Aalen-Johansen estimator in (9), see Beyersmann and Schumacher (2008) for technical details. Use of (9) as an end point in COVID-19 treatment trials accounting for competing events has recently been considered by McCaw et al. (2020b). Study planning of some recent randomized clinical trials on treatment of COVID-19 was also based on assumptions on median times to clinical improvement (Cao et al. (2020), Li et al. (2020)). This will be discussed in Section 4.1. We will use improvement and recovery interchangeably from a methodological perspective as examples for favourable events (although clinically they are of course different).

Statistical approaches
We will assume that follow-up data are complete in that improvement status and vital status are known for all patients on [0, τ ]. In most of the motivating examples of Section 2, both patients alive but without improvement up to day τ and patients who had died were censored at τ . Although censored at this maximal time point, both improvement status and vital status are known for these patients on [0, τ ]. Survival methodology discussed below will allow for right-censoring of patients alive where further follow-up information ceases at the time of censoring, but we assume that this is a minor problem when, e.g., τ = 28 days. Assuming data to be complete in this sense, the Kaplan-Meier estimator of P (T > t) iŝ where the product in the above display is over all unique event times, N (t) is the number of composite events (transitions out of the initial state in Figure1 is the number at risk just prior time t and n is the sample size. Because of complete follow-up on [0, τ ], P (T > t) equals the empirical event-free fraction (n − N (t))/n. Introducing with ∆N 0j (t) type j events precisely at time t, the Aalen-Johansen estimators arê which also equal the usual empirical proportions assuming complete follow-up. Here, one easily sees that a natural balance equation which is maintained even in the presence of censoring, but violated if one were to use to estimate the cumulative probability of a type 1 event, see the motivating examples of Section 2. This Kaplan-Meier-type estimator inevitably overestimates, the reason being that one minus Kaplan-Meier approximates an empirical distribution function, but the cumulative probability of a type 1 event is bounded from above by P (X(T ) = 1).
However, the Kaplan-Meier estimator predominantly used in the motivating examples of Section 2 is based on subdistribution times with censoring time τ leading tõ we haveÑ 01 (t) = N 01 (t), butỸ (t) ≥ Y (t), because censoring dead patients at τ enlarges the risk set by the number of previous deaths. In the current setting, it is easy to demonstrate that see the Appendix. Note that the difference between the right hand side of (11) and the biased Kaplan-Meiertype estimator (10) lies in the use of a different risk set. Any regression model for hazards may be fit to the event-specific hazards, the most common choice being Cox models, with event-specific baseline hazards α 0j;0 (t), event-specific p × 1 vectors of regression coefficients β 0j and a p × 1 vector of baseline covariates Z. Technically, an event-specific Cox model for the type 1 hazard, say, may be fit by only counting type 1 events as events and by additionally censoring type 2 events at the time of the type 2 event. Roles reverse fitting an event-specific Cox model for the type 2 hazard. Interpretationally, this has arguably been a source of confusion, because the biased Kaplan-Meier-type estimator (10) also only counts type 1 events as events and additionally censors type 2 events. The difference between fitting an event-specific Cox model and the Kaplan-Meier-type estimator (10) is that hazard models allow for quite general censoring processes including censoring by a competing event. However, probabilities depend on all event-specific hazards, which is why we have formulated Cox models for all event types above. It is, however, not uncommon to only see results from one event-specific Cox model being reported, see Goldman et al. (2020); Spinner et al. (2020) for two recent examples from COVID-19 treatment trials. In contrast to, e.g., these two studies, event-specific Cox models have not been used in the motivating examples above. Rather a Cox-type model, the Fine and Gray model, for the subdistribution hazard has been employed, If the cumulative improvement probabilities follow the Fine and Gray model, a subdistribution hazard ratio larger than one for treatment signals both an increase of the expected improvement proportion at τ and earlier improvement.
It has been repeatedly argued that any competing events analysis should consider all competing events at hand. For the event-specific hazards, we have therefore formulated two Cox models. For the Fine and Gray approach, postulating a Cox-type model for the 'competing' subdistribution hazard is complicated by (7). However, delayed death on [0, τ ] does not benefit the patient if τ = 28 days. Hence, in the present setting, it will suffice to consider the probability P (X(T ) = 2) of 'competing' probability by common methods for proportions.

Death after improvement or recovery: illness-death model
For instance, McCaw et al. (2020b) broach the issue of longer follow-up in future COVID-19 treatment trials and its impact on meaningful outcomes including time-to-death. Here, one aspect is that prolonged survival on [0, τ ], where, e.g., τ is 28 days, does not benefit patients (Tan, 2020). The aim of the present subsection is to briefly outline how the competing events framework may be extended to also handle death events possibly after improvement or recovery during a longer follow-up. To this end, define for finite times t the transition hazard where we now also model 1 → 2 transitions along the dashed arrow in Figure 1. The model has recently been used to jointly model time-to-progression (not a favorable outcome, of course) or progression-freesurvival and overall survival by Meller et al. (2019). The model is time-inhomogeneous Markov, if α 12 (t; ϑ) does not depend on the finite value of ϑ. Again a proportional hazards model may be fit to the transition hazard, possibly also modelling departures from the Markov assumption, but the interpretation of probabilities arguably is more accesible. One possible outcome could be the probability to be alive after recovery, i.e., P (X(t) = 1) over relevant time regions. In the context of clinical trials such outcomes have recently been advocated by Sommer et al. (2018) for treatment trials for severe infectious diseases and by Bluhmki et al. (2020) for patients after stem cell transplantation whose health statuses may switch between favorable and less favorable. Schmidt et al. (2020) have recently used such a multistate model in a retrospective cohort study on COVID-19 patients, modelling oxygenation and intensive care statuses. For the statistical analysis, the authors used both Cox models of the transition hazards and reported estimated state occupation probabilities and 'average' occupation times.

Some design considerations
Following on from the consideration of the choice of outcomes, their analysis and interpretation in COVID-19 trials, we now look into the consequences a particular choice of outcome has for the design of the trial. We start with sample size considerations and then comment on the use of adaptive designs.

Power and sample size considerations
For a randomized clinical trial for the investigation of the effect of a COVID-19 treatment on clinical improvement or recovery of patients or death, various approaches are conceivable. As described above, the time horizon considered is usually short, often 28 days. So, it can be assumed that the recording of the interesting outcomes as hospitalization, ventilation, clinical symptoms, and death is complete. In this situation, an ordered categorical endpoint as the eight-point ordinal scale proposed in the master protocol of the WHO (WHO, 2020) and e.g. used in a seven-point version in the trial by Goldman et al. (2020) at a prespecified time point (e.g. 28 days) or a simpler binary endpoint as e.g. death or clinical recovery as defined by a dichotomized version of the ordinal scale can be used, as e.g. done by Lee et al. (2020). An ordered categorical endpoint might be analyzed, under the proportional odds assumption, with a proportional odds model, for which sample size planning can be based on the formula proposed by Whitehead (1993). Under more general assumptions, the treatment groups might be compared with respect to an ordinal outcome by a nonparametric rank-based approach using e.g. the Wilcoxon rank sum test and the so-called probabilistic index or relative effect (Kieser et al., 2013). The sample size can then be calculated using the formula provided by Noether (1987) or subsequent refinements using the variance under the alternative (Vollandt and Horn, 1997) or extensions to a variety of alternative hypotheses (Happ et al., 2019). A binary endpoint would usually be analyzed with a logistic regression model for which sample size planning can be based on formula (2) in Hsieh et al. (1998).
Even if the recording of the interesting endpoint can be assumed to be complete, it may be desirable to analyze not just the occurrence of the endpoint within the specified time period, but the time to the occurrence of the endpoint for mainly two reasons. First, as described in Section 2, a time-to-event analysis captures not only a difference between treatments with respect to the proportion of patients for whom the event had occurred, but also a difference between treatments with respect to the time of occurrence. This can be relevant even on a short time interval when the endpoint is e.g. time under mechanical ventilation, which has adverse effects on patients' health the longer it is required. Second, even if completeness of data over the interesting time period is assumed, individual patients might be lossed to follow-up, which can be handled by a time-to-event analysis being able to include censored observations.
In time-to-event analyses we can model the effect of a treatment on the (event-specific) hazard (3) or on the cumulative event probability (4) of experiencing the event. For the planning of clinical trials with time-to-event endpoints one has to distinguish if the event of interest is all-encompassing in the sense that every patient will experience it at some point in time (although potentially after study end), as e.g. all-cause mortality, or if the observation of the interesting event, as e.g. improvement or recovery may be precluded by competing events, as e.g. death without prior recovery.
In the first case, we do not need to make a decision, whether we are mainly interested in the effect of treatment on the hazard or on the cumulative event probability, as comparisons with respect to hazards, usually performed by the logrank test or the Cox proportional hazards regression model, and comparisons with respect to cumulative event probabilities, usually estimated by the Kaplan-Meier method, are equivalent in this situation.
In the second case of an interesting event for which competing events exist, however, the one-to-one correspondence between hazard and cumulative event probability no longer holds. In this situation, not one hazard but two hazards, the so-called event-specific hazards, one for the interesting event and one for the competing event, exist. As described in Section 3.1 the cumulative event probability (4) depends on both event-specific hazards. An analysis of the treatment effect on the event-specific hazards consists of two analyses, one for each event-specific hazard. The analyses of the event-specific hazards can be performed with Cox proportional event-specific hazards regression models where in each analysis the time to the interesting event is censored at the time when the competing event occurs. For the analysis of the treatment effect on the cumulative event probability, the most popular method is the Fine and Gray model (Fine and Gray, 1999) which is a proportional hazards model for the subdistribution hazard which is the hazard 'attached' to the cumulative event probability as described in Section 3.1.
From the fact, that the cumulative event probability depends on both event specific hazards the following conclusions can immediately be drawn (Beyersmann et al., 2012). If treatment as compared to control leads to a decrease (or increase) in the cumulative probability of the interesting event, this can have two reasons. It can be due to a direct (e.g. physiological) effect of treatment on the event-specific hazard of the interesting event or it can be due to an increase (or decrease) the treatment exhibits on the event-specific hazard of the competing event. Based on the analysis of the cumulative probability of the interesting event alone, it is difficult to understand the treatment mechanism leading to a difference in event probabilities between treatment and control groups, since various treatment mechanisms can lead to the same difference in event probabilities. As a consequence, it is usually recommended to conduct three analyses for a complete understanding of treatment mechanisms, namely comparisons between treatment and control with respect to the event-specific hazard of the interesting event, the event-specific hazard of the competing event, and the cumulative event probability of the interesting event (Latouche et al., 2013).
In the planning of a clinical trial, one usually has to pre-specify one treatment effect to be analyzed by one primary analysis (Baayen et al., 2019). In the following we discuss the different approaches of focusing on the event-specific hazard or on the cumulative event probability for the situation of our competing events model in Figure 1 where the interesting event is recovery from COVID-19 and the competing event is death without prior recovery.
For a comparison of treatment groups with respect to the event-specific hazards, the parameter of interest is the event-specific hazard ratio with α 01T (t) denoting the event-specific hazard of the treatment group, and α 01C (t) denoting the eventspecific hazard of the control group. For a comparison of treatment groups with respect to the cumulative probability of the interesting event, the parameter of interest is the subdistribution hazard ratio which follows from (6) under the assumption of proportional subdistribution hazard functions, with F 1T (t) denoting the cumulative probability of the interesting event in the treatment group, and F 1C (t) denoting the cumulative probability of the interesting event in the control group.
In our situation, where the event of interest is recovery, i.e. something favourable, for both quantities θ ES and θ SD superiority of treatment versus control is represented by a value larger than 1.
Whatever the planned analysis, i.e. analysis of the event-specific hazard ratio θ ES or analysis of the subdistribution hazard ratio θ SD , sample size planning for a two-sided level α test with power 1 − β under an assumed hazard ratio θ is typically based on the Schoenfeld formula (Schoenfeld, 1981;Latouche et al., 2004;Ohneberg and Schumacher, 2014;Tai et al., 2018) for the total number of required recovery events with p denoting the probability of being in treatment group T, and u 1−γ denoting the (1 − γ)-quantile of the standard normal distribution. The total number of patients to be randomized can then be calculated as N = E/Ψ, where Ψ denotes the probability of observing a recovery event. In the absence of censoring, as assumed in our situation of a short planned trial duration of let say 28 days, Ψ can be calculated as Although for the analysis of the event-specific hazard ratio and the analysis of the subdistribution hazard ratio the same formula for sample size calculation is often used, sample size planning, statistical analyses, and interpretation of results are different, as θ ES and θ SD represent different parameters as described above. Another issue is that Schoenfeld's formula assumes identical censoring distributions in the treatment groups, see Schoenfeld (1981). This assumption is well justified for time to an all-encompassing endpoint and, technically, it lends itself to a particularly simple approximation of the covariation process of the logrank statistic underlying Schoenfeld's formula. It does, however, have further implications in the presence of competing events. We will illustrate this for the simplistic assumption of constant event-specific hazards of experiencing the interesting event recovery in treatment and control groups, α 01T and α 01C , and of experiencing the competing event death without prior recovery in treatment and control groups, α 02T and α 02C . Hence, the eventspecific hazard ratios of recovery and of death without prior recovery are then given by θ ES = α 01T /α 01C and θ ES−CE = α 02T /α 02C . Under the constant hazards assumption, the cumulative probability of recovery in treatment group k, k = T, C, at time t is given by and the cumulative probability of death without prior recovery in treatment group k, k=T,C, at time t is given by Table 2 shows for different scenarios of assumed event-specific hazards of recovery and death in treatment and control groups and associated event-specific hazard ratios of recovery and death, the resulting cumulative event probabilities at time point 28 days and the resulting subdistribution hazard ratios at time point 28 days. Parameters were chosen to reflect similar scenarios as present in the recently published randomized clinical trials on COVID-19 therapies, where observed probabilities of recovery were around 0.5 to 0.8 and observed probabilities of mortality were around 0.15 to 0.25 (Beigel et al., 2020;Cao et al., 2020;Li et al., 2020;Wang et al., 2020).
Note that the aim of the Table is to illustrate possible constellations of the situation at hand, including some for which one would not plan a trial. To illustrate, when the event-specific recovery hazards in treatment and control are identical (θ ES = 1), a decreasing effect of treatment as compared to control on the event-specific death hazard (θ ES−CE < 1) leads to an increased cumulative recovery probability (θ SD (28) > 1), whereas an increasing effect of treatment as compared to control on the event-specific death hazard (θ ES−CE > 1) leads to a decreased cumulative recovery probability θ SD (28) < 1. Clearly, one would not plan a trial assuming the latter scenario, but it does illustrate that any competing events analysis is incomplete without a look at the competing event.
It is tempting to compare the magnitudes of θ ES and θ ES−CE with that of θ SD (28). A situation of particular interest not just for this comparison arises when there is no treatment effect on the competing eventspecific hazard ratio, θ ES−CE = 1. To begin, recall that any event-specific hazards analysis is performed by Table 1: Event-specific hazard ratios and the subdistribution hazard ratio at time 28 with respect to recovery for different scenarios under the constant hazard assumption handling observed competing events of the other type as censorings. Hence, assuming θ ES−CE = 1 complies with the assumption of equal censorings mechanisms in the groups for using Schoenfeld's formula. Next, a proportional subdistribution hazards model will, in general, be misspecified assuming proportional event-specific hazards as a consequence of (7). However, it has been repeatedly noted thatθ ES ≈θ SD if θ ES−CE ≈ 1 (Beyersmann et al., 2007;Saadati et al., 2018). This is mirrored in the Table, in that scenarios with θ ES−CE = 1 find comparable values of θ ES and θ SD (28). Note, however, thatθ SD will estimate a time-averaged subdistribution hazard ratio, averaged over the whole time span, computation of which requires numerical approximations (Beyersmann et al., 2009). Equality (7) also illustrates that event-specific and subdistribution hazards operate on different scales, and many authors have argued that the subdistribution hazard scale is more difficult to interpret, see Andersen and Keiding (2012) for an in-depth discussion. We therefore refrain from further comparing the magnitudes of the different effect measures and rather continue with considering their impact on sample sizes following from Schoenfeld's formula.
For sample size planning of clinical trials where competing events exist, assumptions are usually based on the expected cumulative event probabilities (Schulgen et al, 2005;Latouche et al., 2013;Baayen et al., 2019;Tai et al., 2018). Under the constant event-specific hazards assumption for both the recovery as well as the death without prior recovery hazard, the underlying hazards can be calculated from the cumulative event probabilities via equations (15) and (16) Tai et al. (2018). Table 2 contrasts for some scenarios of cumulative event probabilities similar to those of some recently published randomized clinical trials on COVID-19 therapies the corresponding subdistribution recovery hazard ratio versus the event-specific recovery hazard ratio calculated from the cumulative event probabilities under the constant event-specific hazards assumption. Additionally it is shown, which sample sizes would result if planning addresses the subdistributon recovery hazard ratio, the event-specific recovery hazard ratio, or the odds ratio (of the binary endpoint recovery until day 28) for a randomized clinical trial which aims to show superiority of treatment as compared to control with respect to recovery from COVID-19 with two-sided type I error of 0.05 and power 0.8. Table 2 invites some discussion. To begin, we reiterate that Schoenfeld's formula assumes identical censoring mechanisms in the treatment groups. This is formally fullfilled when planning an analysis of θ ES when θ ES−CE = 1. In this case, a beneficial (harmful) effect on θ ES directly translates into a beneficial (harmful) effect on the cumulative recovery probability. If the assumption of identical censoring mechanisms is violated, the reported sample sizes should serve as a starting point for simulation based sample size planning in practice. For the subdistribution approach, Latouche et al. (2004) find the use of Schoenfeld's Table 2: Subdistribution recovery hazard ratio and odds ratio (OR) at time 28, and event-specific hazard ratio derived from cumulative event probabilities under the constant event-specific hazard assumption and resulting sample size when chosen as parameter for study planning with two-sided type I error rate of 0.05 and power 0.8. (28)  formula to be quite reliable. This is of relevance for complete data on [0, 28] with τ = 28 as before and different probabilities of death F 2 (28) between groups. Here, the approach to handle deaths before time 28 as censorings at day 28 would imply identical (no) censoring on [0, 28), but different censoring at time 28.
Next, analysis and sample size planning should not be guided by the required number of patients but by the interesting parameter. To this end, we reiterate that subdistributon times and, in particular, subdistributon hazards underly the analyses of recently COVID-19 trials as outlined earlier, and the Table illustrates consequences of this choice. In the Table, the entries F 1T (28), F 1C (28), θ SD (28) and OR(28) do not change, i.e., are assumed to be the same across all scenarios, but the entries for θ ES and θ ES−CE do change, reflecting different entries F 2T (28), F 2C (28). To this end, it is important to note that θ ES and θ ES−CE can be modelled freely, i.e., independent of each other, but, of course, the competing event probabilities do not share this property. In either case, the Table illustrates that careful planning requires assumptions on the event-specific hazard or on the cumulative event probability of the competing event.
In some of the recently published randomized trials on the treatment of COVID-19 (Cao et al. (2020), Li et al. (2020)), sample size planning was performed in terms of assumed median times to clinical improvement. Both Cao et al. (2020) and Li et al. (2020) assumed for the control group a median time to clinical improvement of 20 days and a reduction of this time to 12 days in the active treatment group. For a two-sided significance level of α = 0.05 with a power of 80% this resulted for the trial of Cao et al. (2020) to a total sample size of 160 patients under the assumption that 75% of the patients would reach clinical improvement and for the trial of Li et al. (2020) to a total sample size of 200 patients under the assumption that 60% of the patients would reach clinical improvement, both up to day 28. The proportions of patients with clinical improvement by day 28 were assumed to be different in both trials although identical median times to clinical improvement had been assumed. This could be due to different assumptions regarding the expected mortality rates not mentioned explicitly. From these specifications we speculate that an exponential distribution for time to clinical improvement had been assumed leading to an event-specific hazard ratio of 1.66 and a required number of patients experiencing the event clinical improvement of 120, which leads to the above mentioned patient numbers under the assumed proportions of clinical improvement by day 28. We note that in the statistical analysis of the trials, parameters were estimated from the subdistribution time, i.e. the subdistribution hazard ratio and median times to clinical improvement based on quantity (9), being not quite consistent with the methods used for sample size calculation.
If the aim is to increase, say, the number of recoveries and to obtain these recoveries in a shorter time, the primary analysis may target the cumulative recovery probability as a function of time. Assuming that all or almost all patients experience one of the competing outcomes on [0, τ ], it will suffice to target this probability, because an increase of the recovery probability would then protect against a harmful effect on mortality. One possibility to demonstrate both an increase of the cumulative recovery probability and a shorter time to recovery is to establish a subdistribution hazard ratio larger than one. However, the interpretation of the subdistribution hazard is not straightforward, and an alternative would be a transformation model of the cumulative recovery probability using a logistic link or a comparison of the cumulative recovery probabilities (Eriksson et al., 2015) using confidence bands .
When the cumulative recovery probability on [0, τ ] is the target parameter, we see in Table 2 no large difference in the calculated sample size for the subdistribution hazard ratio based on (13) and (14) as compared to the calculated sample size for the odds ratio of the binary endpoint based on formula (2) in Hsieh et al. (1998). When no competing events are present, it had been shown by Annesi et al. (1989) that efficiency of an analysis with logistic regression is high as compared to an analysis with Cox regression in the situation of a low event rate. We are not aware of a similar efficiency investigation comparing the Fine and Gray model with the logistic model in the presence of competing events. Formula (7) indicates that the subdistribution hazard is lower than the event-specific hazard, so arguments related to low event rates could possibly translate.

Sample size recalculation
As we have seen in Section 4.1, the sample size or power calculations rely on a number of assumptions. In particular in an epidemic situation, there is no or very little prior knowledge regarding relevant parameters. These include the treatment effect but also potentially a range of nuisance parameters such as event probabilities regarding events of interest such as recovery, or competing events such as death. Sample size recalculation procedures were suggested to deal with this type of uncertainty and to make trials more robust to parameter misspecifications in the planning phase (see e.g.  for a recent overview). Generally, two broad classes of procedures are distinguished, namely sample size recalculation based on nuisance parameters and effect-based sample size recalculation.
Designs with sample size recalculation based on nuisance parameters are also known as internal pilot study designs (Friede and Kieser, 2006). The general procedure consists of the following steps: (i) a conventional sample size calculation is carried out at the design stage; (ii) part way through the trial the nuisance parameters are estimated from the available data and the sample size recalculated accordingly; and (iii) in the final analysis the combined sample of the internal pilot study and the remaining trial are analyzed. Nuisance parameters such as event probabilities might relate to the control group or the overall study population across the treatment groups. The latter can obviously be estimated from non-comparative data during the ongoing trial and does not require any unblinding. Therefore, it is often the preferred option, in particular in trials with regulatory relevance (EMA, 2007;FDA, 2018).
With a binary outcome the overall event probability can be considered the nuisance parameter, which can be estimated from the overall sample. Gould (1992) provides sample size recalculation formulas based on the overall event probability for the odds ratio (considered above) as effect measure but also relative risks and risk differences. The latter was also studied in more detail by Friede and Kieser (2004). The specification of the treatment effect is relevant here, since it is kept fixed. With the odds ratio as effect measure and event probabilities below (above) 0.5 a lower (higher) than expected event probability results in a sample size increase, whereas with a risk difference the sample size would be decreased. Under the proportional odds model the blinded procedure for binary outcomes can be extended to ordinal outcomes. Rather than estimating the event probability from the sampled pooled across the treatment arms the distribution of the ordinal outcome is assessed in the pooled sample (Bolland et al., 1998). Guidance on blinded sample size recalculation procedures in time-to-event trials is provided in  and references therein. In designs with flexible follow-up times, the procedures would consider the recruitment, event and censoring processes. In the situation considered here, trials are likely to use a fixed follow-up design following all patients up to τ , say τ = 28 days. From Section 4.1 follows then that the probabilities of the event of interest and of the competing event would be estimated by the Aalen-Johansen estimator at interim. These findings would then be used to update the initial sample size calculation.
In so-called internal pilot study designs, the sample size calculation is typically at a single time point during the study. Since the blinded procedure seems to be uncritical in terms of logistics and type I error rate inflation, repeated recalculations based on blinded data could be considered. Actually the nuisance parameters could even be monitored in a blinded fashion from a certain point in time onwards. This is also known as blinded continuous monitoring (Friede and Miller, 2012). In fact, this is typically done in event driven trials where the total number of events across both treatment arms are monitored. This principle can be transferred to other types of outcomes such as recurrent events .
Group sequential designs belong to the class of designs with effect based sample size adaptation. They are used in many disease areas including oncology as well as cardiovascular and cardiometabolic research. For binary outcomes or ordinal outcomes under the proportional odds model the procedures are well established (Jennison and Turnbull, 2000). Logan and Zhang (2013) described group sequential procedures in the presence of competing events. Classical group sequential designs, however, must proceed in a prespecified manner and the size of the design stages must not be based on observed treatment effects unless prespecified weights for the design stages are used (Cui et al., 1999). The latter procedure is equivalent with the inverse normal combination function by Lehmacher and Wassmer (1999). Some issues in this type of designs with time-to-event outcomes were raised (Bauer and Posch, 2004), but are not a concern in designs with fixed follow-up time considered here as long as the patients are grouped into design stages in the analysis (Friede et al., 2011). For a very recent review on adaptive designs for COVID-19 intervention trials see .

Discussion
In the COVID-19 pandemic the fast development of safe and effective treatments is of paramount importance. Severe forms of COVID-19 require hospitalization and in some cases intensive care. In these settings, recovery, mechanical ventilation, mortality etc. are relevant outcomes. From a statistical viewpoint different approaches to their analysis might be meaningful. Here we argued that a successful treatment of COVID-19 patients (i) increases the probability of a recovery within a certain time interval, say 28 days; (ii) aims to expedite recovery within this time frame; and (iii) does not increase mortality over this time period. We made some recommendations regarding the design and analysis of COVID-19 trials with such outcomes. Since there is no previous experience with COVID-19, sample size calculations have to be informed by data from related diseases. This results in considerable uncertainty which can be mitigated by appropriate adaptive designs including blinded sample size reestimation.
Here we considered trials evaluating treatments of patients suffering from severe forms of COVID-19. Of course, running trials in other disease areas have been affected by the pandemic. The issues and potential solutions are discussed in a recent paper by Kunz et al. (2020). Furthermore, we did not consider vaccine or diagnostic trials. Also, we assumed that event times were recorded on a continuous scale. In practice, however, this is strictly speaking not the case as event times might be reported in terms of days from randomization. In particular with shorter follow up times, this type of discreteness could be dealt with using appropriate models. For an overview, we defer the reader to Schmid and Berger (2020).

Conflict of Interest
The authors have declared no conflict of interest.
As a consequence, canceling the appropriate terms leads to i.e., the number of type 1 events on [0, t] divided by sample size. It is well known that the Aalen-Johansen estimator also equals N 01 (t)/n in the absence of censoring, which completes the argument.